Background: High throughput DNA/RNA sequencing has revolutionized biological and clinical research.\nSequencing is widely used, and generates very large amounts of data, mainly due to reduced cost and advanced\ntechnologies. Quickly assessing the quality of giga-to-tera base levels of sequencing data has become a routine but\nimportant task. Identification and elimination of low-quality sequence data is crucial for reliability of downstream\nanalysis results. There is a need for a high-speed tool that uses optimized parallel programming for batch\nprocessing and simply gauges the quality of sequencing data from multiple datasets independent of any other\nprocessing steps.\nResults: FQStat is a stand-alone, platform-independent software tool that assesses the quality of FASTQ files using\nparallel programming. Based on the machine architecture and input data, FQStat automatically determines the\nnumber of cores and the amount of memory to be allocated per file for optimum performance. Our results indicate\nthat in a core-limited case, core assignment overhead exceeds the benefit of additional cores. In a core-unlimited\ncase, there is a saturation point reached in performance by increasingly assigning additional cores per file. We also\nshow that memory allocation per file has a lower priority in performance when compared to the allocation of\ncores. FQStatâ??s output is summarized in HTML web page, tab-delimited text file, and high-resolution image formats.\nFQStat calculates and plots read count, read length, quality score, and high-quality base statistics. FQStat identifies\nand marks low-quality sequencing data to suggest removal from downstream analysis. We applied FQStat on real\nsequencing data to optimize performance and to demonstrate its capabilities. We also compared FQStatâ??s\nperformance to similar quality control (QC) tools that utilize parallel programming and attained improvements in\nrun time.\nConclusions: FQStat is a user-friendly tool with a graphical interface that employs a parallel programming\narchitecture and automatically optimizes its performance to generate quality control statistics for sequencing data.\nUnlike existing tools, these statistics are calculated for multiple datasets and separately at the â??lane,â? â??sample,â? and\nâ??experimentâ? level to identify subsets of the samples with low quality, thereby preventing the loss of complete\nsamples when reliable data can still be obtained.
Loading....